Skip to content

[Cherry-Pick][BugFix] fix MTP bugs in TP and overlap(#7172)#7192

Merged
freeliuzc merged 2 commits intoPaddlePaddle:release/2.6from
huicongyao:MTP
Apr 8, 2026
Merged

[Cherry-Pick][BugFix] fix MTP bugs in TP and overlap(#7172)#7192
freeliuzc merged 2 commits intoPaddlePaddle:release/2.6from
huicongyao:MTP

Conversation

@huicongyao
Copy link
Copy Markdown
Contributor

@huicongyao huicongyao commented Apr 3, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 3, 2026

Thanks for your contribution!

@huicongyao huicongyao changed the title [Cherry-Pick][BugFix] fix MTP bugs in TP and overlap [Cherry-Pick][BugFix] fix MTP bugs in TP and overlap(#7172) Apr 3, 2026
@huicongyao
Copy link
Copy Markdown
Contributor Author

/re-run approval

@huicongyao
Copy link
Copy Markdown
Contributor Author

/re-run ci_xpu

@freeliuzc freeliuzc merged commit 36909bf into PaddlePaddle:release/2.6 Apr 8, 2026
93 of 106 checks passed
Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-09 10:55 CST

📋 Review 摘要

PR 概述:Cherry-Pick PR,修复 MTP(Multi-Token Prediction)在 TP 和 overlap 场景下的 bug

变更范围:speculative decoding 相关自定义算子、GPU model runner

影响面 Tag[Speculative Decoding] [BugFix] [OP]

📝 PR 规范检查

PR 标题符合 Cherry-Pick 规范,但描述部分 Motivation 和 Modifications 为空,建议补充。

描述模板(建议补充):

## Motivation

修复 MTP 在以下场景下的 bug:
1. TP(Tensor Parallelism)模式下,非 rank 0 的重复采样输出被发送到消息队列,导致资源浪费
2. Overlap scheduling 场景下,`_predict_next_launch_token_num()` 的 token 数量计算方式不正确

## Modifications

1. `speculate_save_output.cc``speculate_save_output_with_topk.cc`- 移除 `save_each_rank` 条件,只保留 `rank_id > 0` 检查
   - 在 TP 模式下,所有 ranks 共享相同的采样输出,只需 rank 0 发送结果

2. `gpu_model_runner.py`- 简化 `next_launch_token_num` 的计算方式
   -`seq_lens.sum() + is_block_step.sum() * token_num_one_step` 改为 `next_real_bsz * token_num_one_step`

问题

级别 文件 概述
❓ 疑问 speculate_save_output.cc:41 save_each_rank 参数语义是否仍需要保留?

总体评价

代码逻辑修改基本合理,针对 TP 场景的修复是正确的。但存在一个潜在疑问:如果 speculate 场景支持 EP(Expert Parallelism)模式,移除 save_each_rank 检查会导致非 rank 0 的输出丢失。建议确认 speculate 场景下 EP 的使用情况,或在注释中明确说明此修改的适用范围。

if (!save_each_rank && rank_id > 0) {
// NOTE(yaohuicong): Skip non-zero TP ranks — they share identical sampling
// outputs, so only rank 0 needs to send results to the message queue.
if (rank_id > 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 save_each_rank 参数的语义已被移除(从 if (!save_each_rank && rank_id > 0) 改为 if (rank_id > 0)),这意味着无论 save_each_rank 的值如何,只有 rank 0 会发送结果到消息队列。

潜在影响:如果 speculate 场景支持 EP(Expert Parallelism)模式,在 EP + TP 混合模式下,不同 EP rank 的输出是不同的,移除此检查会导致非 rank 0 的输出丢失。

建议

  1. 确认 speculate decoding 场景下是否支持 EP 模式
  2. 如果不支持 EP,建议在注释中明确说明此修改仅适用于 TP 模式
  3. 如果支持 EP,需要考虑是否应该保留 save_each_rank 检查

另外,XPU 版本的 speculate_save_output.cccustom_ops/xpu_ops/src/ops/mtp/speculate_save_output.cc)仍然保留原始逻辑,建议考虑是否需要同步修改以保持跨硬件一致性。

int64_t rank_id,
bool save_each_rank) {
if (!save_each_rank && rank_id > 0) {
// NOTE(yaohuicong): Skip non-zero TP ranks — they share identical sampling
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,save_each_rank 参数的语义已被移除。如果此函数在 EP 模式下被调用,可能会导致数据丢失。

next_launch_token_num = (
seq_lens_this_time_cpu.sum().item() + is_block_step_cpu.sum().item() * token_num_one_step
)
next_launch_token_num = next_real_bsz * token_num_one_step
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 此简化看起来是合理的,因为在 speculate decoding 场景下,每个 sequence 在每一步处理的 token 数量应该是固定的(token_num_one_step)。

不过,建议添加注释说明原始计算方式的问题,例如:

# In MTP (Multi-Token Prediction) mode, each sequence processes a fixed number of
# tokens per step (num_speculative_tokens + 1), so we can simplify the calculation
# from seq_lens.sum() + is_block_step.sum() * token_num_one_step to
# next_real_bsz * token_num_one_step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants